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1. INTRODUCTION 

Sport is an activity that involves the active movement of an athlete either in the form of a team 
or individual to complete with the opposite team [1]. It is one of the elements of measurement for a countries 
development as it brings a good reputation to a country if its team win in the game [2]. In order to achieve 
high performance in sport by the athletes, coaches and sports professionals play a vital role to evaluate 
and train their athletes [3]. Among various methods, sport video analysis is one of the methods to evaluate 
the performance level of athletes and to enhance training techniques [4]. A process to assess the performance 
level of an athlete is known as performance analysis [5]. Performance analysis can be divide into two which 
are technical analysis and tactical or notational analysis [6]. Through technical analysis, we would get 
an answer to the question: how does the game is performed by the athletes. On the other hand, through 
tactical or notational analysis the question of what activity is performed would be answered [7]. 

Over the past few years, in the field of computer vision, different approaches have been 
implemented to analyse sport videos. During the early stage traditional handcrafted approaches were 
proposed by many researchers for sport video analysis [8]. After the development of technology, 
deep leaning approach has become the current topic of interest in sport video analysis due to its successful 
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performance in other computer vision areas such as human-computer interface, handwriting recognition 
and speech recognition [9]. Generally, sport video analysis using such approach can be divide into several 
directions such as tracking players as well as ball, trajectory detection and action recognition. This paper 
aims to review current trends in deep leaning approach in human activity recognition-based sport video 
analysis. The remainder of this paper is organized as section 2 illustrates traditional handcrafted architecture. 
Next section 3 presents deep learning architecture. Section 4 discusses the drawback of state of the art works 
and Section 5 present conclusion and future works to address. 


2. TRADITIONAL HANDCRAFTED ARCHITECTURE 
In this section, overview of handcrafted architecture and its application in sport video analysis was 
studied based on prior research. 


2.1. Overview of handcrafted architecture 

Before the advancement of technology and the emergence of deep learning architecture, sport video 
analysis is mainly performed using handcrafted features [9]. Traditional handcrafted features are human 
design features for specific problems [10]. It consists of feature descriptors and feature extractors that used to 
manually extract the important information from the sport video for further analysis. However, 
these traditional handcrafted features could only execute low-level features extraction [11]. Further in this 
section would like to present the existing traditional handcrafted architecture that was used widely 
in computer vision to recognize activity in video. 

During the prehistoric ages of sport video analysis, action recognition was performed using 
manually human design features to extract and represent features for action recognition by which this 
technique is called a handcrafted approach. At first, low-level action features such as histogram of optical 
flows (HOF), histogram of gradients (HOG), sparse spatial-temporal interest points (STIP) features were 
designed to extract features in videos for analysis [9]. Then, those sparse features were embellished to dense 
spatio-temporal features [12]. After some time, improved dense trajectories (IDT) were innovated which 
comprises optical flow and speeded up robust features (SURF) [13]. To further enhance iDT multi-layer 
staked fisher vector (FV) was developed [14]. Multi-skip Feature staking was also used for feature extraction 
of action recognition in videos which improve the performance of action recognition [15]. Although, 
these handcrafted features were evolved with time and also improved in performance but it could only 
perform in a specific problem. It is not flexible to be used in other datasets. Table 1 shows some 
of the existing popular handcrafted features used in action recognition for video analysis such as sport 
video analysis. 


Table 1. List of popular handcrafted features used for action recognition in videos 


Handcrafted features Reference 
Histogram of Optical Flow (HOF) and [16, 17] 
Histogram of Gradients (HOG) 
Sparse Spatio-temporal interest points [18] 
Improved Dense Trajectories (DT) [13] 
Fish Vector (FV) [14] 
Multi-layer Feature Stacking (MLFS) [15] 


2.2. Sport video analysis using handcrafted features 

During the early stages of video analysis in computer vision such as in sports, handcrafted features 
were used as mentioned earlier. For instance, Abdulmunem et al. [19] were proposed a method for action 
recognition based on salient object spotting using local and global descriptors in KTH and UCF-Sports 
datasets. In this method, they were using handcrafted 3D-SIFT-HOOF (SGSH) features which were pass 
through SVM after encoding with bag of visual words approach. Besides, Carnonneau et al. [20] presented a 
novel technique to identify play break events in hockey videos. They used bag-of-words event detectors to 
identify the key events like line-change and play-break STIP were used in final decision making by creating 
context descriptor. Figure 1 the shows schematic block diagram of the event detector used in [20]. 

Zhu et al. [7] introduced a novel method to detect an event in a soccer games to extract tactical 
information. In this research work, they employ multi-object trajectory and field locations of the event shots 
to recognize the semantic event in the soccer games. Moreover, Lee et al. [21] implement pattern matching 
techniques to automatically recognize events in soccer games using a multi-object tracking unit and motion 
recognizer. Their proposed block diagram is shown in Figure 2. 

Lien et al. [22] studied scene-based event detection for baseball videos by implementing various 
handcrafted features such as image-based features, object-based features and global motion to capture 
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semantic information from the baseball sport video. Then those, extracted features were fed into hidden 
Marvon model (HMM) to classify the detected events. Chen et al. [23] developed a novel handcrafted model 
based on player trajectories reconstruction method which includes court detection, player detection, player 
tracking and homography transformation in broadcasted basketball videos. 
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Figure 1. Schematic block diagram of event detectors in [20] 
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Figure 2. Block diagram of soccer event detection system [21] 


Lu et al. [24] constructed a system to automatically track various hockey players as well as 
recognize their activity concurrently in broadcasted hockey video. In this study, HOG descriptors were 
utilised to detect the players and to model the appearance of the players a probabilistic framework was 
designed with a mixture of local subspaces. Additionally, boosted particle filter (BPF) was employed in this 
tracking system. Mukherjee et al. [25] proposed a novel descriptor based on improved dense trajectories in 
human action and event recognition. Their design also used Fisher Vector (FV) and the proposed novel 
approach was trained with binary support vector machine (SVM). The dataset used for this research was UCF 
sports, CMU Mocap and Hollywood? datasets for event and action recognition. 


3. DEEP LEARNING 
In this section, overview of deep learning archicture and its application in sport video analysis was 
reviewed based on previous studies. 


3.1. Overview of deep learning architecture 

Deep learning architecture works automatically by directly classifying raw input images or video 
frames through multiple processing layers so as to learn and represent data [26]. Unlike traditional 
handcrafted architecture, it does not requires any feature descriptors or feature extractors. For instance, 
deep learning architecture uses local perception, down pooling, weight sharing, a multi-convolution kernel, 
etc. to automatically learn local features from just a segment of an image rather than whole image [27]. 
Deep learning techniques able to classify high-level or complex action recognition which attracts huge 
research of interest [28]. Examples of widely used deep learning models is convolutional neural network 
(CNN), recurrent neural network (RNN) and long short-term memory (LSTM). 

Excellent performance with overwhelming accuracy of deep learning in visual task inspired 
the exploitation of deep learning in video analysis. Initially, CNN works independently to extract information 
from the still images [29]. However, 2D-CNN fails to extract temporal information in video sequences. 
In order to overcome this issue, 3D-CNN is then constructed to extract both spatial and temporal information 
of video frames [30]. Following this RNN was used in action recognition. RNN based method effectively 
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capture temporal information whereby its current prediction is based on both current observations as well as 
past information [31]. But in spite of that, RNN architecture only has short term memory, which could not 
apply in the real-world scenario. To alleviate this problem, LSTM model was proposed. This model able to 
extract temporal information from sequential video data. LSTM model has a memory unit that decides when 
to remember and forget hidden states [32]. Due to its superiority, the LSTM model broadly used in computer 
vision applications such as action recognition. Table 2 show a comparison between deep learning models. 


Table 2. Comparison between deep learning model 


Model Advantage Drawback 
2D-CNN ~^" Automatically capture = Could not able to capture the temporal 
the spatial information in = information in video data 
the image patches 
3D-CNN =" Automatically capture both = Expensive model due to 3D 
spatial and temporal 
information 
RNN = Automatically capture = Has short memory ability, could not 
the temporal information apply in real situation 
in sequential data = Gradient explosion 
LSTM = Automatically capture = NIL 


the temporal information 
in sequential data 


3.2. Deep learning architecture in sport video analysis 

Despite the astonishing performance of deep learning approach in various computer vision 
application such as voice recognition, text recognition it also achieves outstanding results in sport video 
analysis in recent years. Although it is still in the early stage of application in sport video analysis and only 
very few research has been done, yet so far its performance is more accurate as compared to traditional 
approaches and it’s getting more attention presently [33]. Tora et al. [34] proposed a novel deep learning 
approach in classifying puck possession events in ice hockey. They used pre-trained CNN to first extract 
the features then use LSTM for classification of the five types of events which are dump in, dump out, 
loose puck recovery, pass and shot. Sozykin et al. [35] presented a 3D CNN based action recognition system 
for multi-class imbalanced in ice hockey. They first extract features from both single image and a slice 
of frames using CNN. Then, to solve the multi-class imbalance, they introduce two different strategies that 
can compare its performance separately which are the ensemble of k independent single-label learning 
networks and a single multi-label k-output network. It is found that as compared to the ensemble model, 
single multi-label k output model achieves high performance. Longteng et al. [36] designed a joint 
framework comprising of both athlete tracking and action recognition in sport videos using scaling 
and occlusion robust tracker with spatial pyramid pooling network (SPP-net) [37] and LSTM to extract 
motion, spatial and temporal features. 

Jiang et al. [38] introduce deep learning based automatic soccer video event detection by using CNN 
and RNN. This paper focused on 4 types of event which are goal, goal attempt, corner and card. CNN model 
was used to extract image features. And RNN structure was used to capture temporal relation. However, 
in this paper three different types of RNN structures (traditional RNN model, LSTM model and gated 
recurrent unit (GRU) model) were used to determine the optimum model. LSTM is the best-performed model 
among those three RNN structures. Hong et al. [39] delivered deep transfer learning based end-to-end soccer 
video scene and event classification. Scene classification includes long view and close-up view while event 
classification includes corner, free-kick, goal and penalty. Their classification model used CNN models as 
well as transfer learning method. Whereas, Yu et al. [40] designed a deep learning-based soccer event 
detection that includes event detection as well as story generation refer to Figure 3 which begins with event 
clips and end with replay event. In this design CNN and LSTM were used. 

Besides, in tennis sport, Mora et al. [41] have proposed a domain-specific deep learning action 
recognition method by utilising pre-trained CNN with three-layered LSTM model. This paper was aim to 
recognised fine-grained action in tennis sports. The deep LSTM network used in this research is able to learn 
high-level structures and provides high accuracy. Apart from that, Ibrahim et al. [42] innovated a deep model 
to extract dynamic temporal information in volleyball using LSTM models. The game state was deduced by 
capturing players’ state through the two-stage LSTM model. Ramanathan et al. [43] established a deep 
leaning method to detect and categorised basketball events using RNN. This work used an attention model to 
detect key players from multi-person videos first. Then used CNN and LSTM to extract feature and detect an 
event in basketball. 
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Figure 3. Proposed soccer story detection framework [40] 


4. DISCUSSION 

For the past few decades, there was quite a lot of research focusing on sport video analysis based on 
handcrafted architectures. However, this handcrafted architecture which consists of feature descriptors 
and extractors are problem-dependent. In other words, those handcrafted features can only apply to specific 
problems. It could not be used for other datasets or problems which is irrelevant to apply in a real scenario. 
Apart from that, the traditional handcrafted architecture used for sport video analysis could only extract 
low-level features, it is very challenging in capturing for high-level semantic information [44]. 
Thus, the advancement in technology leads to the emergence of deep learning-based sport video analysis. 
Since deep learning-based sport video analysis is still a new and growing research field, there were only 
a few studies found. Most of the studies are focusing on soccer games, tennis, baseball and basketball [45]. 
Only limited resources are found on the other sports such as hockey and badminton. However, the main 
drawback of using deep learning models are its data-hungry meaning to learn features automatically from 
raw inputs data it is depends on thoudsands of input data [46]. Besides, to elevate the performance level 
of the deep learning model it needs high-performance GPUs [47]. But, the recent advancement in technology 
and the growth in big data have overcome these issues. 

Previously, CNN model which is one of the deep learning approach has shown tremendous success 
in image recognition, speech recognition audio recognition, etc [48]. However, in the analysis of video input 
data researchers face many challenges because video sequences dynamically evolve with time. It is difficult 
to extract temporal information. With the continuous study in video analysis using deep learning, cause 
the establishment of a sequential models such as RNN and LSTM. These models are able to extract temporal 
information in video input data. There were some researches work on the combination of both CNN 
and LSTM model to extract spatio-temporal information. But only a few researches were found in extraction 
high-level semantic information in sport video analysis. Despite astonishing performance of deep learning 
based archicture, the advancement achieves in image classification have not been reached in certain field like 
video classification or sport video analysis [49]. It is still an open issue in deep learning-based research in 
which many researchers try to solve and it is an ongoing research work [50]. 


5. CONCLUSION 

This paper contributes a comprehensive survey on sport video analysis by comparing both 
handcrafted and deep learning approach. In summary, deep learning approach has overcome the limitations 
encountered by traditional methods in activity recognition of sport video analysis. However, only a few 
research has focused on sport video analysis. So, in future studies, the researchers can focus on extraction 
high-level semantic information in sport analysis which will be used by coaches and sports professionals in 
evaluating players’ tactical performance in the game. Moreover, future research should also concentrate more 
on sport videos that are apart from soccer games, tennis, baseball and basketball as almost 80% of prior 
research had been focused on those sports. 
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